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Abstract — We consider the estimation of an i.i.d. random vector 
observed through a linear transform followed by a component- 
wise, probabilistic (possibly nonlinear) measurement channel. A 
novel algorithm, called generalized approximate message passing 
(GAMP), is presented that provides computationally efficient ap- 
proximate implementations of max-sum and sum-problem loopy 
belief propagation for such problems. The algorithm extends 
earlier approximate message passing methods to incorporate 
arbitrary distributions on both the input and output of the 
transform and can be applied to a wide range of problems in 
nonlinear compressed sensing and learning. 

Extending an analysis by Bayati and Montanari, we argue 
that the asymptotic componentwise behavior of the GAMP 
method under large, i.i.d. Gaussian transforms is described 
by a simple set of state evolution (SE) equations. From the 
SE equations, one can exactly predict the asymptotic value 
of virtually any componentwise performance metric including 
mean-squared error or detection accuracy. Moreover, the analysis 
is valid for arbitrary input and output distributions, even 
when the corresponding optimization problems are non-convex. 
The results match predictions by Guo and Wang for relaxed 
belief propagation on large sparse matrices and, in certain 
instances, also agree with the optimal performance predicted by 
the replica method. The GAMP methodology thus provides a 
computationally efficient methodology, applicable to a large class 
of non- Gaussian estimation problems with precise asymptotic 
performance guarantees. 

Index Terms — Optimization, random matrices, estimation, be- 
lief propagation, graphical models, compressed sensing. 

I. Introduction 

The problem of estimating vectors from linear transforms 
followed by random, possibly nonlinear, measurements, arises 
in a range of problems in signal processing, communications, 
and learning. This paper considers a general class of such 
estimation problems in a Bayesian setting shown in Fig. [T] 
An input vector q £ Q n has components qj £ Q for some set 
Q and generates an unknown random vector x £ R n through 
a componentwise input channel described by a conditional 
distribution Px\Q( x j\Qj)- The vector x is then passed through 
a linear transform 

z = Ax, (1) 

where A £ W mxn is a known transform matrix. Finally, each 
component of Z{ of z randomly generates an output component 
yi of a vector y £ Y m through a second scalar conditional 
distribution PY\z(]Ji\ z i)i where Y is some output set. The 
problem is to estimate the transform input x and output z 
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from the system input vector q, system output y and linear 
transform A. 

The formulation is general and has wide applicability - we 
will present several applications in Section [TT| However, for 
many non-Gaussian instances of the problem, exact computa- 
tions of quantities such as the posterior mode or mean of x is 
computationally prohibitive. The primary difficulty is that the 
matrix A "couples" or "mixes" the coefficients of x into z. 
If the transform matrix were the identity matrix (i.e. m = n 
and A = /), then the estimation problem would decouple 
into m = n scalar estimation problems, each defined by the 
Markov chain: 

( \ PX\Q(Xj\qj) PY\z{Vj\ z o) 

Qj~PQ(Qj) — > x 3= z 3 — > Vj- 

Since all the quantities in this Markov chain are scalars, one 
could numerically compute the exact posterior distribution on 
any component Xj(= Zj) from qj and yj with one-dimensional 
integrals. However, for a general (i.e. non-identity) matrix A, 
the components of x are coupled into the vector z. In this case, 
the posterior distribution of any particular component Xj or Z{ 
would involve a high-dimensional integral that is, in general, 
difficult to evaluate. 

This paper presents a novel algorithm, generalized approx- 
imate message passing or GAMP, that decouples the vector- 
valued estimation problem into a sequence of scalar problems 
and linear transforms. The GAMP algorithm is an extension 
of several earlier Gaussian and quadratic approximations of 
loopy belief propagation (loopy BP) that have been successful 
in many previous applications, including, most recently, com- 
pressing sensing (TJ-0 (See the "Previous Works" subsection 
below). The proposed methodology extends these methods to 
provide several valuable features: 

• Generality: Most importantly, the GAMP methodology 
applies to essentially arbitrary priors Px\q an d out- 
put channel distributions Py\z- The algorithm can thus 
incorporate arbitrary non-Gaussian inputs as well as 
output nonlinearities. Moreover, the algorithm can be 
used to realize approximations of both max-sum loopy 
BP for maximum a posteriori (MAP) estimation and 
sum-product loopy BP for minimum mean- squared error 
(MMSE) estimates and approximations of the posterior 
marginals. 

• Computational simplicity: The algorithm is computation- 
ally simple with each iteration involving only scalar 
estimation computations on the components of x and z 
along with linear transforms. Indeed, the iterations are 
similar in form to the fastest known methods for such 
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problems [T0)-(T4), while being potentially more general. 
Moreover, our simulations indicate excellent convergence 
in a small number, typically 10 to 20, iterations. 
• Analytic tractability: Our main theoretical result, Claim [T] 
shows that for large Gaussian i.i.d. transforms, A, the 
asymptotic behavior of the components of the GAMP 
algorithm are described exactly by a simple scalar equiv- 
alent model. The parameters in this model can be com- 
puted by a set of scalar state evolution (SE) equations, 
analogous to the density evolution equations for BP 
decoding of LDPC codes (15), (16). From this scalar 
equivalent model, one can asymptotically exactly predict 
any componentwise performance metric such as mean- 
squared error (MSE) or detection accuracy. Moreover, 
the SE analysis generalizes results on earlier approximate 
message passing-like algorithms (l)-(9), anc ^ a ^ so ' m 
certain instances, show a match with the optimal per- 
formance as predicted by the replica method in statistical 
physics (8), (TTJ-gg. 

The GAMP algorithm thus provides a general and system- 
atic approach to a large class of estimation problems with 
linear mixing that is computationally tractable and widely- 
applicable. Moreover, in the case of large random A, the 
method admits exact asymptotic performance characteriza- 
tions. 



A. Prior Work 

The GAMP algorithm belongs to a long line of methods 
based on Gaussian and quadratic approximations of loopy 
belief propagation (loopy BP). Loopy BP is a general method- 
ology for estimation problems where the statistical relation- 
ships between variables are represented via a graphical model. 
The loopy BP algorithm iteratively updates estimates of the 
variables via a message passing procedure along the graph 
edges [22], [23]. For the linear mixing estimation problem, the 
graph is precisely the incidence matrix of the matrix A and the 
loopy BP messages are passed between nodes corresponding 
to the input variables Xj and outputs variables Z{. 

The GAMP algorithm provides approximations of two im- 
portant variants of loopy BP: 

• Max- sum loopy BP for approximately computing MAP 
estimates of x and z as well as sensitivities of the 
maximum of the posterior distribution to the individual 
components of x and z; and 

• Sum-product loopy BP for approximations of the min- 
imum mean-squared error (MMSE) estimates of x and 
z and the marginal posteriors of their individual compo- 
nents. 

We will call the GAMP approximations of the two algorithms, 
max- sum GAMP and sum-product GAMP, respectively. 

In communications and signal processing, BP is best known 
for its connections to iterative decoding in turbo and LDPC 
codes |24 ]— [26]. However, while turbo and LDPC codes 
typically involve computations over finite fields, BP has also 
been successfully applied in a number of problems with linear 
real-valued mixing, including CDMA multiuser detection (T), 
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Fig. 1. General estimation problem with linear mixing. The problem is to 
estimate the transform input and output vectors x and z from the system 
input and output vectors q and y, channel transition probability functions 
Px\q('\')i Py\z{'\') an d transform matrix A. 



(27), (28) , lattice codes (29) and compressed sensing [6], (30) , 
(3lj. 

Although exact implementation of loopy BP for dense 
graphs is generally computationally difficult, Gaussian and 
quadratic approximations of loopy BP have been widely and 
successfully used for many years. Such approximate methods 
originated in CDMA multiuser detection problems in (TJ- 
|5], and, more recently, have attracted considerable interest 
in compressed sensing (6)-(9), (20) , |2T) . The methods have 
appeared under a variety of names including "approximate 
BP", "relaxed BP" and, most recently, "approximate message 
passing". Gaussian approximations are also used in expecta- 
tion propagation (32) , (33) , as well as the analysis and design 
of LDPC codes (^-(36^ 

The GAMP algorithm presented here is most closely related 
to the approximate message passing (AMP) methods in [6), 
(37), (3|) as well as the relaxed BP methods in (4), (5J, 
(8). The AMP method |61 used a quadratic approximation 
of max-sum loopy BP to derive an efficient implementation 
of the LASSO estimator (39), (40). The LASSO estimate is 
equivalent to a MAP estimate of x assuming a Laplacian 
prior. This method was then generalized in (37), (38) to imple- 
ment Bayesian estimation with arbitrary priors using Gaussian 
approximations of sum-product loopy BP. The relaxed BP 
method in (5), (8) offers a further extension to nonlinear 
output channels, but the algorithm is limited to sum-product 
approximations of loopy BP and the analysis is limited to 
certain random, sparse matrices. 

The GAMP method proposed in this paper provides a 
unified methodology for estimation with dense matrices that 
can incorporate arbitrary input and output distributions and 
provide approximations of both max-sum and sum-product 
loopy BP. 

Moreover, similar to previous analyses of AMP-like algo- 
rithms, we argue that that the asymptotic behavior of GAMP 
has a sharp characterization for certain large, i.i.d. matrices. 
Specifically, for large random A, the componentwise behavior 
of many AMP techniques can be asymptotically described 
exactly by a set of state evolution (SE) equations. Such 
analyses have been presented for both large sparse matrices 
PL F)> @' @' an d' more recently, for dense i.i.d. matrices 
(6), Q. The validity of the SE analysis on dense matrices was 
particularly difficult to rigorously prove since conventional 
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density evolution techniques such as (HJ, |T6| need graphs 
that are locally cycle-free. The key breakthrough was an 
analysis by Bayati and Montanari in (7) that provided the 
first completely rigorous analysis of the dynamics of message 
passing in dense graphs using a new conditioning argument 
from [41 ]. That work also provided a rigorous justification of 
many predictions p7|-pT|, (42) of the behavior of optimal 
estimators using the replica method from statistical physics. 

The main theoretical contribution of this paper, Claim [T] can 
be seen as an extension of Bayati and Montanari 's analysis in 
(7) to GAMP - the novelty being the incorporation of arbitrary 
output channels. Specifically, we present a generalization of 
the SE equations to arbitrary output channels recovering 
several earlier results on AWGN channels as special instances. 
The SE equations also agree with the results in (5), (8) for 
relaxed BP applied to arbitrary output channels for large sparse 
matrices. 

The extension of Bayati and Montanari 's analysis in (7) to 
incorporate arbitrary output channels is relatively straightfor- 
ward. Unfortunately, a full re-derivation of their result would 
be long and beyond the scope of this paper. We thus only 
provide a brief of sketch of the main steps and our result is 
thus not fully rigorous. To emphasize the lack of rigor, we use 
the term Claim instead of Theorem. The predictions, however, 
are confirmed in numerical simulations. 

A conference version of this paper appeared in (43). This 
paper includes all the proofs and derivations along with more 
detailed discussions and simulations. 



B. Outline 

The outline of the remainder of this paper is as follows. As 
motivation, Section [TT] provides some examples of the linear 
mixing problems. The GAMP method is then introduced in 
Section lin| and precise equations for the max- sum and sum- 



product versions are given in Section IV The asymptotic 
state evolution analysis is presented in Section |V| and shown 
to recover many previous predictions of the SE analysis 
in several special cases discussed in Section |Vl| A simple 
numerical simulation to confirm the predictions is presented 



in Section VII which considers a nonlinear compressed sensing 
problem. Finally, since the original publication of the paper, 
there has been considerable work on the topic. The conclusions 
summarize this work and present avenues for future research. 

II. Examples and Applications 

The linear mixing model is extremely general and can be 
applied in a range of circumstances. We review some simple 
examples for both the output and input channels from [8]. 

A. Output Channel Examples 

a) AWGN output channel: For an additive white Gaus- 
sian noise (AWGN) output channel, the output vector y can 
be written as 

y = z + w = Ax + w, (2) 



where w is a zero mean, Gaussian i.i.d. random vector inde- 
pendent of x. The corresponding channel transition probability 
distribution is given by 



py\z{v\z) = 



l 



-. exp 



(y - *? 

2t w 



(3) 



where r w > is the variance of the components of w. 

b) Non-Gaussian noise models: Since the output channel 
can incorporate an arbitrary separable distribution, the linear 
mixing model can also include the model ^ with non- 
Gaussian noise vectors w, provided the components of w 
are i.i.d. One interesting application for a non-Gaussian noise 
model is to study the bounded noise that arises in quantization 
as discussed in [8]. 

c) Logistic channels: A quite different channel is based 
on a logistic output. In this model, each output ^ is or 1, 
where the probability that yi = 1 is given by some sigmoidal 
function such as 



PY\z(Vi l\Zi) 



1 + aexp(— Zi 



(4) 



for some constant a > 0. Thus, larger values of zi result in a 
higher probability that yi = 1. 

This logistic model can be used in classification problems 
as follows (44): Suppose one is given m samples, with each 
sample being labeled as belonging to one of two classes. Let 
yi = or 1 denote the class of sample i. Also, suppose that the 
ith row of the transform matrix A contains a vector of n data 
values associated with the ith sample. Then, using a logistic 
channel model such as Q, the problem of estimating the 
vector x from the labels y and data A is equivalent to finding 
a linear dependence on the data that classifies the samples 
between the two classes. This problem is often referred to 
as logistic regression and the resulting vector x is called the 
regression vector. By adjusting the prior on the components 
of x, one can then impose constraints on the components of 
x including, for example, sparsity constraints. 



B. Input Channel Examples 

The distributions Px\q( x j\Qj) models the prior on Xj, with 
the variable qj being a parameter of that prior that varies over 
the components, but is known to the estimator. When all the 
components of Xj are identically distributed, we can ignore qj 
and use a constant distribution px(xj)- 

d) Sparse priors and compressed sensing: One class of 
non-Gaussian priors that have attracted considerable recent 
attention is sparse distributions. A vector x is sparse if a 
large fraction of its components are zero or close to zero. 
Sparsity can be modeled probabilistically with a variety of 
heavy-tailed distributions px (%j ) including Gaussian mixture 
models, generalized Gaussians and Bernoulli distributions with 
a high probability of the component being zero. The estimation 
of sparse vectors with random linear measurements is the basic 
subject of compressed sensing |45)-|47| and fits naturally into 
the linear mixing framework. 
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e) Discrete distributions: The linear mixing model can 
also incorporate discrete distributions on the components of x. 
Discrete distribution arise often in communications problems 
where discrete messages are modulated onto the components 
of x. The linear mixing with the transform matrix A comes 
into play in CDMA spread spectrum systems and lattice codes 
mentioned above. 

f) Power variations and dynamic range: For multiuser 
detection problems in communications, the parameter qj can 
be used to model a priori power variations amongst the users 
that are known to the receiver. For example, suppose qj > 



almost surely and Xj 



with Uj being identically 



distributed across all components j and independent of qj. 
Then, E(x?\qj) = qjK(Uj) and qj thus represents a power 
scaling. This model has been used in a variety of analyses of 
CDMA multiuser detection (4), (T8), [48 ] and random access 
channels (491. 



III. Generalized Approximate Message Passing 

We begin with a description of the generalized approximate 
message passing (GAMP) algorithm, which is an extension of 
the AMP procedure in |6), (7). Similar to the AMP method, 
the idea of the algorithm is to iteratively decompose the vector 
valued estimation problem into a sequence of scalar operations 
at the input and output nodes. In the GAMP algorithm, the 
scalar operations are defined by two functions, g ut( m ) and 
<7in(*)> that we call the scalar estimation functions. We will see 



in Section [IV] that with appropriate choices of these functions, 
the GAMP algorithm can provide Gaussian and quadratic 
approximations of either sum-product and max-sum loopy BP. 

The steps of the GAMP method are shown in Algorithm [T] 
The algorithm produces a sequence of estimates, x(t) and z(t), 
for the unknown vectors x and z. The algorithm also outputs 
vectors r x (t) and r s (t). As we will see in the case of the sum- 
product GAMP (Section |IV-B| ), these have interpretations as 
certain variances. Although our analysis later is for real- valued 
matrices, we have written the equations for the complex case. 

A. Computational Complexity 

We will discuss the selection of the scalar estimation 
functions and provide a detailed analysis of the algorithm per- 
formance later. We first describe the algorithm's computational 
simplicity - which is a key part of the algorithm appeal. 

Each iteration of the algorithm has four steps. The first 
step, the output linear step, involves only a matrix-vector 
multiplications by A and |A| 2 , where the squared magnitude 
is taken componentwise. The worst case complexity is 0(mn) 
and would be smaller for structured transforms such as Fourier, 
wavelet or sparse. The next step — the nonlinear step — 
involves componentwise applications of the nonlinear output 
estimation function g ou t(-) on each of the m components of 
the output vector p. As we will see, the function g ou t(-) does 
not change with the dimension, so the total complexity of the 
output nonlinear step is complexity of 0(m). Similarly, the 
input steps involve matrix-vector multiplications by A T and 
(|A| 2 ) T along with componentwise scalar operations at the 



Algorithm 1 Generalized AMP (GAMP) 
Given a matrix A G R mxn , system inputs and outputs q and 
y and scalar estimation functions gm('), and gm( m ), generate 
a sequence of estimates x(£), z(t), for t = 0, 1, . . . through 
the following recursions: 

1) Initialization: Set t = and set xj(t) and T*(t) to some 
initial values. 

2) Output linear step: For each z, compute: 



rf(t) 
Pi(t) 

Zi{t) 



3 

^OijXj^-Tf^Siit-l), (5b) 
3 

^aijXjit) (5c) 



where initially, we take s(— 1) = 0. 

3) Output nonlinear step: For each i, 

si CO = 9out(t,Pi(t),yi,T?(t)) 

rm = 9 

4) Input linear step: For each j, 



(6a) 



0^9ont(t,Pi(t),yi,Tf(t)). (6b) 



rj(t) = 



5) Input nonlinear step: For each j, 



%(m) 
rf(m) 



^r^.q^rjit)) 
d 



(7a) 
(7b) 

(8a) 



^W^in^f.W,^,^^)). (8b) 



Then increment t = t + 1 and return to step 2 until a 
sufficient number of iterations have been performed. 



input. The complexity is again dominated by the transforms 
with a worst-case complexity of 0(mn). 

Thus, we see the GAMP algorithm reduces the vector- 
valued operation to a sequence of linear transforms and scalar 
estimation functions. The worst-case total complexity per 
iteration is thus 0(mn) and smaller for structured transforms. 
Moreover, as we will see in the state evolution analysis, the 
number of iterations required for the same per component per- 
formance does not increase with the problem size, so the total 
complexity is bounded by the matrix-vector multiplication. In 
addition, we will see in the simulations that good performance 
can be obtained with a small number of iterations, usually 10 
to 20. 

It should be pointed out that the structure of the GAMP 
iterations - transforms followed by scalar operations - underly 
many of the most of successful methods for linear-mixing type 
problems. In particular, the separable approximation method 
of (TO) and alternating direction methods in pT|-p4| are all 
based on iterations of this form. The contribution of the current 
paper is to show that specific instance of these transform + 
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separating algorithms can be interpreted as an approximation 
of loopy BP and admits precise asymptotic analyses. 



B. Further Simplifications 

To further reduce computational complexity, the GAMP 
algorithm can be approximated by a modified procedure shown 
in Algorithm [2] In the modified procedure, the variance 
vectors, r r (t) and r p (t), are replaced with scalars r r (t) and 
r p (t), thus forcing all the variance components to be the same. 
This approximation would be valid when the components of 
the transform matrix A are the approximately equal so that 
\dij\ 2 « \\A\\ 2 F /mn for all i, j. 

The approximations in Algorithm [2] can also be heuristically 
justified when A has i.i.d. components. Specifically, if A is 
i.i.d., m and n are large, and the dependence between the 
components of \Aij\ 2 and the vectors r x (t) and r s (t) can 
be ignored, the simplification in Algorithm [2] can be justified 
through the Law of Large Numbers. Of course, \Aij\ 2 is not 
independent of r x (t) and r s (t), but in our simulation below 
for an i.i.d. matrix, we will see very little difference between 
Algorithm [T] and the simplified version, Algorithm [2] 

The benefit of the simplification is that we can eliminate 
the matrix multiplications by |A| 2 and (|A| 2 ) T - reducing 
the number of transforms per iteration from four to two. 
This savings can be significant, particularly when the A 2 
and (A 2 ) T have no particularly simple implementation. The 
simplified algorithm, Algorithm |2j also more closely matches 
the AMP procedure of |6], and will be more amenable to 
analysis later in Section [V] 



IV. Scalar Estimation Functions to Approximate 
Loopy BP 

As discussed in the previous section, through proper se- 
lection of the scalar estimation functions gm(-) and <7out(*)> 
GAMP can provide approximations of either max- sum or sum- 
product loopy BP. With these selections, we can thus realize 
the two most useful special cases of GAMP: 

• Max-sum GAMP: An approximation of max- sum loopy 
BP for computations of the MAP estimates and compu- 
tations of the marginal maxima of the posterior; and 

• Sum-product GAMP: An approximation of sum-product 
loopy BP for computations of the MMSE estimates and 
the marginal posterior distributions. 

Heuristic "derivations" of the scalar estimation functions for 
both of these algorithms are sketched in Appendices [C] and 
iDl and summarized here. The selection functions are also 
summarized in Table [I] We emphasize that the approximations 
are entirely heuristic - we don't claim any formal properties 
of the approximation. However, the analysis of the GAMP 
algorithm with these or other functions will be more rigorous. 



A. Max-Sum GAMP for MAP Estimation 

To describe the MAP estimator, observe that for the linear 
mixing estimation problem in Section [TJ the posterior density 



Algorithm 2 GAMP with Scalar Variances 

Given a matrix A G R mxn , a system input and output vectors 

q and y, selection functions gi n ( m ), and gi n ( m ), generate a 

sequence of estimates x(t), z(t), for t = 0, 1, . . . through the 

recursions: 

1) Initialization: Set t = and let Xj(t) and r x (t) be any 
initial sequences. 

2) Output linear step: For each z, compute: 



r*(t) = (l/m)\\A\\ 2 F r x (t) 



(9a) 



Pi(t) = ^a^%(£)-r^)? z (£-l), (9b) 

3 

%{t) = ^2^ijXj(t) (9c) 



where initially, we take s(— 1) = 0. 
3) Output nonlinear step: For each i, 



= gout(t,Pi(t),yuT p (t)) 

T S (t) = 

4) Input linear step: For each j, 



(10a) 

1 m 

-Y,^out(tMt),Vi,T P (t)). (10b) 

m f— ' op 

i—l 



l/r r (t) = (l/n)||A||£r s (*) 



(Ha) 



,(t) = Xj^ + T^t^aijMt)- (Hb) 



5) Input nonlinear step: 

T X (t+l) = 



(12a) 



^ E |^in(i, rj(t), q 3 ,T r (t)). (12b) 
n or 
3=1 

Then increment t = t + 1 and return to step 2 until a 
sufficient number of iterations have been performed. 



of the vector x given the system input q and output y is given 
by the conditional density function 

Px|q, y (x|q, y) := z ^ ^ exp (F(x, Ax, q, y)) , (13) 

where 



^(x,z,q,y) := ^/ in ( ) + ^2fout(zi,Vi), (14) 

j=l i=l 



and 



fout(z,y) '= logpY\z(y\z) (15a) 
f- m (x,q) := \ogp x \ Q (x\q). (15b) 

The constant Z(q, y) in fl3] ) is a normalization constant. 
Given this distribution, the maximum a posteriori (MAP) esti- 
mator is the maxima of ( [T3] ) which is given by the optimization 

x map := argmaxF(x, z,q, y), z = Ax. (16) 
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Method 


Input scalar estimation functions 


Output scalar estimation functions 


9in(r,q,r r ) 


r r g{ n (r,q,T r ) 


flbut (p,y,r p ) 


-9out(p>y> rP ) 


Max-sum GAMP 

for MAP estimation 


arg max x F[ n (z, f, q, r r ) 


r r /(l-r r f[^q)) 


(z°-p)/rP 

z° := arg max z F out (z, p, y, r p ) 


C^tfJ/rat^y)-!) 


Sum-product GAMP 

for MMSE estimation 


E(x\r,q,r r ) 

R = x +A/"(0,T r ) 


var(x\r, q, r r ) 


z° :=E(z\p,y,TP) 

Y~P Ylz ,Z~Ar(p, T P) 


(rP(t)-^r(z\p,y))/(rP(t)) 2 


AWGN 


T x0 {r-q)/(T r +T x0 ) + q 
X =Af(q,T x0 ) 


T x0 T r /(r r +T x0 ) 


(y-p)/(r w +rP) 
Y = Z + J\f(0,T w ) 


1/(t w +tP) 



TABLE I 

Scalar input and output estimation functions max-sum and sum-product GAMP. For AWGN inputs and outputs, the scalar 

ESTIMATION FUNCTIONS FOR BOTH MAX-SUM AND SUM-PRODUCT ALGORITHMS ARE THE SAME AND HAVE A PARTICULARLY SIMPLE FORM. 



For each component j, we will also be interested in the 
marginal maxima of the posterior distribution 

Aj(xj) := maxF(x, z, q, y), z = Ax, (17) 

where the maximization is over vector x £ R n with a fixed 
value for the coefficient Xj. Note that the component Xj of 
the MAP estimate is given by Xj = arg max Aj(xj). This, 
Aj(xj) can be interpreted as the sensitivity of the maxima to 
the value of the component Xj. 

Note that one may also be interested in the optimization 
( [16] ) where the objective is of the form ( [T4] ), but the functions 
/in(-) and /out(') are not derived from any density functions. 
The max-sum GAMP method can be applied to these general 
optimization problems as well. 

Now, an approximate implementation of max-sum BP for 
the MAP estimation problem ( [16] ) is described in Appendix 
[C] It is suggested there that a possible input function to 
approximately implement max-sum BP is given by 



# in (f,<?,T r ) := argmaxF in (x,f,<2,T r ) (18) 



where 



F in (x, r, q, r r ) := / in (x, q)-—(f- xf . (19) 

Here, and below, we have dropped the dependence on the 
iteration number t when it is not needed. The Appendix also 
shows that the function ( [18] ) has a derivative satisfying 



T r JL#in(r,<?,T r ) 



(20) 



where the second derivative f[^(x,q) is with respect to x 
and x = g\n{r,q 1 T r ). Also, the marginal maxima ( [17] ) is 
approximately given by 



Aj(xj) w Finixj.rj.qj.rJ) + const, 



(21) 



where the constant term does not depend on xj. 

The initial condition for the GAMP algorithm should be 
taken as 

xj (0) = arg max / in ( Xj , q 5 ) , r/ (0) = — ) r . (22) 

Jinril u i! Qj) 

This initial conditions corresponds to the output of ([8]) with 
t = 0, the functions in ([18]) and ([20]) and r r (-l) oo. 



Observe that when f[ n is given by ( |15b| ), gi n (r,q,T r ) is 
precisely the scalar MAP estimate of a random variable X 
given Q = q and = r for the random variables 



R = X + V, V-A/"(0,r r ), 



(23) 



where X ~ Px\Q( x \q)> Q ~ Pq(q) an ^ wlt h V independent 
of X and Q. With this definition, R can be interpreted as a 
Gaussian noise-corrupted version of X with noise level r r . 

Appendix [C] shows that the output function for the approx- 
imation of max-sum BP is given by 

^out(p,2/,^):=4(^-p) 5 (24) 



where 



z :-- 



argmaxF out (2;,p,y,r p ), 



and 



^outO, P,y,r p ) := f ou t(z,y) 



2rP 



(z-p) 2 . 



The negative derivative of this function is given by 



(25) 



(26) 



(27) 



l-r^' ut (F,2/)' 

where the second derivative f" ut (z , y) is with respect to z. 

Now, when f out (z,y) is given by ( [Ha) , F out (z,p,y,r p ) 
in ( [26] ) can be interpreted as the log posterior of a random 
variable Z given Y = y and 

Z~A/-(p,T^), y ~py,z(2/k). (28) 

In particular, z° in ( [25] ) is the MAP estimate of Z. 

We see that the max-sum GAMP algorithm reduces the 
vector MAP estimation problem to a sequence of scalar 
MAP estimations problems at the inputs and outputs. The 
scalar parameters r r (t) and r p (t) represent effective Gaussian 
noise levels in these problems. The equations for the scalar 
estimation functions are summarized in Table U 

B. Sum-Product GAMP for MMSE Estimation 

The minimum mean squared error (MMSE) estimate is the 
conditional expectation 

X mmse :=E[x | y,q], (29) 
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relative to the conditional density ( [T3) . We are also interested 
in the log posterior marginals 



Aj(xj) := \ogp(xj\q,y). 



(30) 



The selection of the functions #in(-) and g ou t(-) to ap- 
proximately implement sum-product loopy BP to compute 
the MMSE estimates and posterior marginals is described in 
Appendix [D] Heuristic arguments in that section, show that 
the input function to implement BP-based MMSE estimation 
is given by 

£ in (?,4,r r ) :=E[X | R = r,Q = q], (31) 

where the expectation is over the variables in ( [23] ). Also, the 
derivative is given by the variance, 

rr ^9in(r, q, r r ) := var[X | R = f, Q = q]. (32) 

In addition, the log posterior marginal ( [30] ) is approximately 
given by §T\) . From the definition of F{ n (-) in ( [T9) we see 
that the posterior marginal is approximately given by 



pOj|q,y) 



l 



Px\Q(xj\qj)exp 



1 

'2r r 



(?j -Xjf 



(33) 



- z 

where Z is a normalization constant. 

The initial condition for the GAMP algorithm for MMSE 
estimation should be taken as 

%(0) = E(X | Q = q 3 ) 7-/(0) = var(X | Q = Qj ), (34) 

where the expectations are over the distribution Px\Q( x j\Qj)- 
Thus, the algorithm is initialized to the the prior mean and 
variance on Xj based on the parameter qj but no observations. 
Equivalently, the initial conditions ( [34] ) corresponds to the 
output of ([8]) with t = 0, the functions in ([31} and ( [32] ) and 
r r (-l) oo. 

To describe the output function g ou t(p,y,T p ), consider a 
random variable z with conditional probability density 

p(z\p,y,r p ) oc expF ou t(z,p,?/,r p ), (35) 

where F out (z,p,y,T p ) is given in ( |26) . The distribution ( [35] ) 
is the posterior density function of the random variable Z 
with observation Y in ( [28] ). Appendix [P] shows that the output 
function g ou t(p,y,T p ) to implement approximate BP for the 
MMSE problem is given by 

t (p,V,T p ):=\(zP-p), z° :=E(z\p,y,Tn, (36) 



9out{ 



T p 



where the expectation is over the density function ( [35] ). Also 
the negative derivative of g Q ut(') * s given by 

var(z\p,y,T p ) 



^oyit{P,y,T P ) = 



1 



(37) 



Appendix [D] also provides an alternative definition for 
#out(-) : The function # O ut(0 in ([36]) is given by 

d 

9out(p,y,T p ) := ^logp(?/|p,r p ), (38) 

where p(y\p, r p ) is the density is from the channel ( [28] ). As a 
result, its negative derivative is 



Hence g ut(p,y,T p ) has the interpretation of a score function 
of the parameter p in the distribution of the random variable 
Y in {58). 

Thus, similar to the MAP estimation problem, the sum- 
product GAMP algorithm reduces the vector MMSE estima- 
tion problem to the evaluation of sequence of scalar estimation 
problems from Gaussian noise. Scalar MMSE estimation is 
performed at the input nodes, and a score function of an ML 
estimation problem is performed at the output nodes. 

C. AWGN Output Channels 

In the special case of an AWGN output channel, we will see 
that that the output functions for max- sum and sum-product 
GAMP are identical and reduce to the update in the AMP 
algorithm of Bayati and Montanari in (7). Suppose that the 
output channel is described by the AWGN distribution ^ for 
some output variance r w > 0. Then, it can be checked that 
the distribution p(z\p,y,r p ) in ([35]) is the Gaussian 



p(z\p,y,rP)^m^,r z ), 



where 



T p 



(y-p), 



(40) 

(41a) 
(41b) 



It can be verified that the conditional mean z° in ( |41a| ) agrees 
with both z° in ( [25] ) for the MAP estimator and z° in ( [36] ) for 
the MMSE estimator. Therefore, the output function g out for 
both the MAP estimate in ([24) and MMSE estimate in ([36) is 
given by 

y-p 



9out(p,y,T p ) 



T p 



(42) 



(43) 



The negative derivative of the function is given by 

Op T p + T w 

Therefore, for both max-sum and sum-product GAMP 
gout{p,y-,T p ) and its derivative are given by ( |42) and ([43). 
When, we apply these equations into the input updates ([5), 
we precisely obtain the original AMP algorithm (with some 
scalings) given in (7). 



D. AWGN Input Channels 

Now suppose that the input density function Px\q( x \q) is 
a Gaussian density 

Px\q( x \q) =J\T(q,T x0 ), 

for some variance r x0 > 0. Then, it is easily checked that the 
input estimate function ^m(-) and its derivative are identical 
for both max-sum and sum-product GAMP and given by 



_9 
dp 



.gout(p,y,r p ) 



dp 2 



#in(r,<2,r r ) := 

T gin( r ,Q, T ) : = 



r x0 



:{r-q) + q (44a) 



(44b) 



log p(y\p, r p ). (39) The functions are shown in Xable m 
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V. Asymptotic Analysis 

We now present our main theoretical result, which is the 
SE analysis of GAMP for large, Gaussian i.i.d. matrices A. 

A. Empirical Convergence of Random Variables 

The analysis is a relatively minor modification of the results 
in Bayati and Montanari's paper [7|. The work (7) employs 
certain deterministic models on the vectors and then proves 
convergence properties of related empirical distributions. To 
apply the same analysis here, we need to review some of 
their definitions. We say a function </> : R r — >> R s is pseudo- 
Lipschitz of order k > 1, if there exists an L > such for any 
x, y G R r , 

||^(x) - 0(y)|| < L(l + ||xf - 1 + ||yf -^Hx - y||. 

Now suppose that for each n = 1,2,..., v^ n ) is a block 
vector with components v- n ^ G R s , i = 1,2, ...,£(n) 
for some £(n). So, the total dimension of v( n ) is s£(n). 
We say that the components of the vectors v^ n ) empirically 
converges with bounded moments of order k as n — » oo to a 
random vector V on R s if: For all pseudo-Lipschitz continuous 
functions, <f>, of order k, 

i{n) 

lim ^E^K (n) ) = E w v )) <o °- 

When the nature of convergence is clear, we may write (with 
some abuse of notation) 

lim v< n) = V. 

n— )-oo 

B. Assumptions 

With these definitions, we can now formally state the 
modeling assumptions, which follow along the lines of the 
asymptotic model considered by Bayati and Montanari in (T). 
Specifically, we consider a sequence of random realizations of 
the estimation problem in Section [T] indexed by the input signal 
dimension n. For each n, we assume the output dimension 
m = m(n) is deterministic and scales linearly with the input 
dimension in that 

lim n/m(n) = /3, (45) 

n— )-oo 

for some f3 > called the measurement ratio. We also assume 
that the transform matrix A G R mxn has i.i.d. Gaussian 
components ~ A/"(0, 1/ra) and z = Ax. 

We assume that for some order k > 2, the components 
of initial condition x(0), r x (0) and input vectors x and q 
empirically converge with bounded moments of order 2k — 2 
as 

lim (%(0),r/(0),^, % ) ^ (X ,r*(0),X,Q), (46) 

for some random variable triple (Xq, X, Q) with joint density 
Px x q(^o? ^ q) and constant r x (0). 

To model the dependence of the system output vector y on 
the transform output z, we assume that, for each n, there is a 
deterministic vector w G W m for some set W, which we can 



think of as a noise vector. Then, for every i = 1, . . . , m, we 
assume that 

y i = h(z h w i ) (47) 

where h is some function h : R x W —> Y and Y is the output 
set. Finally, we assume that the components of w empirically 
converge with bounded moments of order 2k — 2 to some 
random variable W with distribution pw(w). We will write 
Py\z(u\ z ) f° r me conditional distribution of Y given Z in this 
model. 

We also need certain continuity assumptions. Specifically, 
we assume that the partial derivatives of the functions 
g in (t,r,q,T r ) and g out (t,p,h(z,w),r p ) with respect to f, p 
and z exist almost everywhere and are pseudo-Lipschitz con- 
tinuous of order k. This assumption implies that the functions 
9in(ti q, r r ) and g ou t(t : p, y, r p ) are Lipschitz continuous in 
r and p respectively. 

C. State Evolution Equations 

Similar to (7), the key result here is that the behavior of 
the GAMP algorithm is described by a set of state evolution 
(SE) equations. For the GAMP algorithm, the SE equations are 
easiest to describe algorithmically as shown in in Algorithm 
[3] To describe the equations, we introduce two random vectors 
- one corresponding to the input channel, and the other 
corresponding to the output channel. At the input channel, 
given a r , £ r G R, define the random variable triple 

e r (r,a r ):=(X,Q,fl), (48) 

where the distribution (X, Q) are derived from the density in 
( |46| ) and R is given 

R = a r X + V, V~A/X(U r ), (49) 

with V independent of X and Q. At the output channel, given 
a covariance matrix K p G R 2x2 , K p > 0, define the four- 
dimensional random vector 

ffP(K?):={Z,P,W,Y), Y = h(Z,W), (50) 

such that W and (Z, P) are independent with distributions 

(Z,P)~A/-(0,K7), W~pw(w). (51) 

Since the computations in ( [53] ) and ( [54] ) are expectations over 
scalar random variables, they can be evaluated numerically 
given functions g in (-) and g out (-). 

D. Main Result 

To simplify the analysis, we consider the GAMP method 
with scalar variances, Algorithm El Our simulations indicate 
no difference between Algorithms~[l] and [2] at moderate block 
lengths. We also assume the following minor modifications 

• The variance r r (t) in (T\) is replaced by its deterministic 
limit r r (t)\ 

• The variance r p (t) in ^ is replaced by its deterministic 
limit r p (t); and 

• The norm || A|||i is replaced by its expectation E|| A|||, = 
n. 



RANGAN 



9 



Algorithm 3 GAMP State Evolution 



Given scalar estimation functions gm(') and <7out(*)> me den- 
sity function Px xq anc * initial value 7^(0) in ( |46| ), the 
density function py\z at the output and the measurement ratio 
P = m/n, compute the state evolution parameters as follows: 
1) Initialization: Set t = 0, let r x (0) be the initial value in 
d46l) and set 



K*(0) = cov(X,X(0)), 



(52) 



meaning the covariance matrix of the random variables 
(X,X(Q)) in the limit g5J. 
2) Output node update: Compute 



r p it) 
r r (t) 



pT x (t), K p (t) = /3K x (t) (53a) 



-E" 



d 



3out (t,P,Y,TP(t)) 



(r(t)) 2 E g 2 out (t,P,Y,T?(t)) 



(53b) 
(53c) 



where the expectations are over the random variable 
triples p (K p (t)) = (Z,P,W,Y) in <(50b. Also, let 



a r (t) = T r (t) 
d 



E 



dz 



g out (t,P,h(z,W),TP(t)) 



. (53d) 



3) Input node update: Compute 



T X (t+l) 



T r (t)E 



d_ 
dr 



3in (t,Q,R,T r (t)) (54a) 



K x (t+l) = cov(X,X(m)) 



(54b) 



where the expectation is over the random variable triple 

r (£ r (t),a r (t)) = (X,Q,R) in d48l, and 



X(t+l)=g in (t,Q,R,T r (t)). 
Increment t = t + 1 and return to step 2. 



Scalar input 
channel 



v(t)~N(0,T r (t)) 
AWGN noise Scalar 
estimator 



Px\Q( x j^j) 



a(t) 



n(t) 
— ► 



■PqQ 



AMP estimate 
— ► Xj(t) 



Fig. 2. Scalar equivalent model. The joint distribution of any one component 
xj and its estimate Xj(t) from the t\h iteration of the AMP algorithm is 
identical asymptotically to an equivalent estimator with Gaussian noise. 



(c) The components of the vectors z, p, w and y empirically 
converge with bounded moments of order k as 



lim (zi,pi(t),Wi,yi 



(K p (t)), 



(57) 



where p (K p (t)) = (Z,P,W,Y) is the random vector 
in < [5Q] ) and K p (t) is given by the SE equations. 

A proof of the claim is sketched in Appendix [A] We use 
the term "claim" here since the proof relies on an extension 
of a general result from [7 ] to vector-valued quantities. Due 
to space considerations, we only sketch a proof of the exten- 
sion. The term "claim", as opposed to "theorem," is used to 
emphasize that the details have been omitted. 

A useful interpretation of Claim [T] is that it provides 
a scalar equivalent model for the behavior of the GAMP 
estimates. The equivalent scalar model is illustrated in Fig. [2] 
To understand this diagram, observe that part (b) of Claim [T] 
shows that the joint empirical distribution of the components 
(xj,qj,rj(t)) converges to # r (£ r (t), a r (t)) = (X,Q,R) in 
( [48] ). This distribution is identical to rj(t) being a scaled and 
noise-corrupted version of Xj. Then, the estimate Xj(t) = 
9in(t,rj(i),qj,Tj(i)) is a nonlinear scalar function of rj(t) 
and qj. A similar interpretation can be drawn at the output 
nodes. 



These simplifications are likely not significant, since we will 
show that, under these simplifications, for all t, r r (t) — » r r (t) 
and r p (t) — » r p (t). Also, since A has i.i.d. components with 
variance 1/ra, (l/n)E||A|| 2 — )► 1. Moreover, it is possible 
that one could formally justify the simplification with the 
arguments in |50| . 

Claim 1: Consider the GAMP with scalar variances, Algo- 
rithm [2| under the assumptions in Section V-B and with the 
above modifications. Then, for any fixed iteration number t: 

(a) Almost surely, we have the limits 

lim r r (t) = r r (t), lim r p (t) = r p (t). (55) 

n— )-oo n— )>oo 

(b) The components of the vectors x, q, f and x empirically 
converge with bounded moments of order k as 

lim (x^qj^t)) = 6 r (e(t),a r (t)), (56) 

n— )*oo 

where fT(<f (f), a r (t)) = (X,Q,R) is the random vari- 
able triple in (|48]> and £, r (t) is from the SE equations. 



VI. Special Cases 
We now show that several previous results can be recovered 



from the general SE equations in Section V-C as special cases. 



A. AWGN Output Channel 

First consider the case where the output function ( [47] ) is 
given by additive noise model 



Ui h(zi,Wi) = Zi + Wi. 



(58) 



Assume the components Wi of the output noise vector empir- 
ically converge to a random variable W with zero mean and 
variance r w > 0. The output noise W need not be Gaussian. 
But, suppose the GAMP algorithm uses an output function 
<7out(') m |42]) corresponding to an AWGN output for some 
postulated output variance t^ os ^. that may differ from t w . This 
is the scenario analyzed in Bayati and Montanari's paper (Tj. 

Substituting ([43} with the postulated noise variance r™ ost 
into (|53b|) we get 



r(t) 



' post 



r p (t) 



' post 



(59) 
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Also, using d42j) and ([58]), we see that 



_9 

dz 



]out(p,h(z,y),r p (t)) 



d z + w — p 



1 



Therefore, ( |53d| ) implies that a r (t) = 1. 
Now define 



[1 -l]K*(f) 



1 



-1 



(60a) 
(60b) 



With these definitions, note that if (Z,P) ~ Af(0,K p (t)) 9 
then 

= E(Z-P) 2 . (61) 



(62) 



Also, with K x (t) defined in ( |54b| ), 

^(t+l) = E(X-X(£)) 2 
= E 

Therefore, 

= (r r (t+l)) 2 

(6) 



E(Y - P) 2 



(^ost+^(^+l)) 2 

- E(y - p) 2 

= E(W + Z - P) 2 = r w + E(Z - P) 2 

= r"+e P (t+i) 

(e) 



•/3E 



X-£ in (t,i?,Q,r r (t)) 



(63) 



where (a) follows from substituting ([42]) into ( |53c| ); (b) follows 
from ( [59] ); (c) follows from the output channel model assump- 
tion ( [58] ); (d) follows from ( [6T) and (e) follows from ( [60] ) 
and (|62^ The expectation in ([63) is over r (f r (£), a r (£)) = 
(X,Q,R) in ( |48) . Since a r (t) = 1, the expectation is 
over random variables (X,Q,R) where X ~ Px|qOe|(?)> 
Q ~PQ(q) and 



(64) 



£ = x + f, 7^(o,r(t)), 



and V is independent of X and Q. With this expectation, ( [63] ) 
is precisely the SE equation in [ 7 ] for the AMP algorithm with 
a general input function #i n (-)- 

5. Sum-Product GAMP with AWGN Output Channels 

The SE equation ( [63) applies to a general input function 
9m(')- Now suppose that the estimator uses an input function 
<7i n (*) based on the sum-product estimator ( |3T) . To account 
for possible mismatches with the estimator, suppose that the 
sum-product GAMP algorithm is run with some postulated 
distribution P^q(-) that may differ from the true distribu- 
tion Px\q(')- Let x^^ se (r : q : r r ) be the corresponding scalar 
MMSE estimator 

Sg£L(f, ^ := E = f, Q = g) , 

where the expectation is over the random variable 



R = X + V, V~A/"(0,T r ), 



where X given Q follows the postulated distribution 
Px\q( x \q) anc * ^ * s independent of X and Q. Let 
^mmse(Mj rr ) denote the corresponding postulated variance. 
Then, ( [3T) and ( [32) can be re-written 



g- m (r,q,T r )=x^ se (r,q,T r ), 



(66a) 



and 



r r ^5in(r, q, r r ) = £^L(f, q, r r ). (66b) 



With this notation, 



^(t+l) ( =^+^(t+l) 



< ost +^E[^ se ^^^W)] (67) 



where (a) follows from ([59) and (b) follows from substituting 
([66b) into ([54a). Also, substituting ( |66a| ) into ( [63) , we obtain 

l 2 



nm) = r- + /3E x-^ se ^,Q,r(t)) 



(68) 



The fixed point of the updates ( [67] ) and ( [68] ) are precisely 
the equations given by Guo and Verdu in their replica analysis 
of MMSE estimation with AWGN outputs and non-Gaussian 
priors [18]. Their work uses the replica method from statistical 
physics to argue the following: Let x post be the exact MMSE 
estimate of the vector x based on the postulated distributions 
P V x\q anc * postulated noise variance Tp OS ^. By exact, we 
mean the actual MMSE estimate for that postulated prior 
- not the GAMP or any other approximation. Then, in the 
limit of large i.i.d. random matrices A, it is claimed that the 
joint distribution of (xj^qj) for some component j and the 
corresponding exact MMSE estimate x^ ost , follows the same 
scalar model as Fig. [2] for the GAMP algorithm. Moreover, the 
effective noise variance of the exact MMSE estimator is a fixed 
point of the updates ( [67) and ( [68) . This result of Guo and Veru 
generalized several earlier results including analyses of linear 
estimators in |5T| and [48 ] based on random matrix theory 
and BPSK estimation in |17| based on the replica method. 
As discussed in [7], the SE analysis of the AMP algorithm 
provides some rigorous basis for the replica analysis, along 
with an algorithm to achieve the performance. 

Also, when the postulated distribution matches the true 
distribution, the general equations ( [67) and ( [68) reduce to the 
SE equations for Gaussian approximated BP on large sparse 
matrices derived originally by Boutros and Caire [1] along 
with other works including [4] and [52]. In this regard, the 
analysis here can be seen as an extension of that work to 
handle dense matrices and the case of possibly mismatched 
distributions. 

C Max-Sum GAMP with AWGN Output Channels 

Now suppose that the estimator uses an input function g{ n (-) 
based on the MAP estimator ([18) again with a postulated 
distribution P^q(-) that may differ from the true distribution 
Px\q(')- Let x^^(r,q,r r ) be the corresponding scalar MAP 
estimator 



(65) 



^map( r > 9i T ) '•- 



arg max 



fm(x,q) - 2~^(r- 



xf 



(69) 
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where 



f- m (x,q) = \ogp^(x\q). 



With this definition, xg^* is the scalar MAP estimate of the 
random variable X from the observations R = r and Q = q 
in the model ( [65] ). Also, following p0| ) define 



where J = a^ap- Then, ([T8j and ( f20| > can be re- written as 



g in (r,q,T r )=x^(r,q,T r ), 



and 



. d_ 



(70a) 



(70b) 



Following similar computations to the previous subsection, we 
obtain the SE equations 



r r (t+l) 

r(*+i) 



' post 



•/3E 



££S5(fl,Q,^(t)) 



•/?E 



X-2^(5,Q,r(f)) 



(71a) 
'(71b) 



Similar to the MMSE case, the fixed point of these equations 
precisely agree with the equations for replica analysis of MAP 
estimation with AWGN output noise given in (53) and related 
works in (19). Thus, again, the GAMP framework provides 
a rigorous justification of the replica results along with an 
algorithm to achieve the predictions by the replica method. 

D. General Output Channels with Matched Distributions 

Finally, let us return to the case of general (possibly non- 
AWGN) output channels and consider the sum-product GAMP 



algorithm with the estimations functions in Section |IV-B| In 
this case, we will assume that the postulated distributions for 
Px\q an d Py\z match the true distributions. Guo and Wang 
in (5| derived SE equations for standard BP in this case for 
large sparse random matrices. The work [ 8] derived identical 
equations for a relaxed version of BP. We will see that the 
same SE equations will follow as a special case of the more 
general SE equations above. 

To prove this, we will show by induction that, for all t, 
K. x (t) has the form 



K x (t) = 



r x0 



r x (t) r 



f x0 
xO 



(72) 



where t x0 is the variance of X. For t = 0, recall that for 
MMSE estimation, r x (0) = t x0 and %(0) = (the mean of 
X) for all j. Therefore, ([52]) shows that ( [72] ) holds for t = 0. 

Now suppose that ( |72| ) holds for some t. Then, ( |53a| ) shows 
that K p (t) has the form 



K p (t) 



r z0 



r z0 



r z0 
r z0 



(73) 



where r z0 = /3t x0 . Now, if (Z, P) ~ Af(0, K p (t)) with K^(t) 
given by j73]), then the conditional distribution of Z given P is 
Z ~ Af(P, T p {t)). Therefore, g ou t(') in ((38J can be interpreted 

d 



g ou t(p,y,T p ) := — logp y| p(y|P = p), 



(74) 



where p y |p(-|-) is the likelihood of Y given P for the random 
variables 6 p (K. p (t)) in 

Now let F(r p (i)) be the Fisher information 



F{T p {t)) := E 



d 



M \ogp nP {Y\P) 



(75) 



where the expectation is also over 9 p {KP{t)) with K p (t) given 
by ( |73"1 >. A standard property of the Fisher information is that 



F(T p (t)) = E 



& 



^logp Yl p(Y\P) 



Substituting <|75j and into Q we see that 

r(t) = at)= 1 



F(T p (t)Y 
We will show in Appendix [E] that 

d 



(76) 



(77) 



E 



(/ ollt (t,h(Z,W),P,TP(t)) 



E 



dz 



g out (t,h(Z,W),P,T p (t)) 



It then follows from ( |53d| ) that 

a r (r;) = 



1. 



(78) 



(79) 



Now consider the input update ( |54| ). Since a r (7;) = 1 
and r r (t) — £ r (t), the expectation in ( [3T] ) agrees with the 
expectations in ( [54] ). Substituting ([32]) into ( |54a| ), we obtain 



(80) 



r x (7; + l) =E var(X|Q,P) 



where the expectation is over 6 r (i) , a r (t)) in ( [48] ) with 
a = 1. Also, if X(t+1) = E(X\Q,R) then 



E(X-X(t; + 1)) 2 



E 



var(X\Q,R) =r x (t+l) 



X(t+l)(X-X(t+l)) 



0. 



These relations, along with the definition r x0 = E(X 2 ), show 
that the covariance K. x (t+1) in ( |54b| ) is of the form ( [72] ). This 
completes the induction argument to show that ( [72 ) and the 
other equations above hold for all t. 

The updates ( [77] ) and ( [80] ) are precisely the SE equations 
given in (5) and (8) for sum-product BP estimation with 
matched distributions on large sparse random matrices. The 
current work thus shows that the identical SE equations hold 
for dense matrices A. In addition, the results here extend 
the earlier SE equations by considering the cases where the 
distributions postulated by the estimator may not be identical 
to the true distribution. 

VII. Nonlinear Compressed Sensing Example 

To validate the GAMP algorithm and its SE analysis, we 
considered the following simple example of a nonlinear com- 
pressed sensing problem. The distribution on the components 
of the input, x, is taken as a Bernoulli-Gaussian: 







prob = 1-/9, 



Xj ~ ^ A/"(0, 1) prob = p, 



(81) 
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Fig. 3. Sigmoidal output function for the nonlinear compressed sensing 
example, along with a scatter plot of the noisy points (zi,m) in one typical 
realization. 
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Fig. 4. Normalized squared error of the of the sparse vector estimates in 
the nonlinear compressed sensing problem. Plotted are the median squared 
errors for Monte Carlo trials of the GAMP algorithm using both a linear 
approximation of the output (Lin-GAMP) and the true nonlinear output (NL- 
GAMP). Also plotted are state evolution (SE) predictions of the asymptotic 
performance. 



where, in the experiments below, we set p = 0.1. The distri- 
bution ( [81) provides a simple model of a sparse vector with 
p being the fraction of non-zero components. To match the 
state evolution (SE) theory, the linear transform A G W 71 x n is 
generated as an i.i.d. Gaussian matrix with ~ A/*(0, l/n). 
For the output channel, we consider nonlinear measurements 
of the form 

Vi = f(zi) + w u f(z) = 1/(1 + exp(-a*)), (82) 

and Wi ~ A/*(0, r w ). The function f(z) in ( |82| ) is a sigmoidal 
function that arises commonly in neural networks and provides 
a good test of the ability of GAMP to handle nonlinear 
measurements. The output variance is taken as r w = 0.01, 
or 20 dB below the full range of the output of f(z). The scale 
factor in ([82]) is set to a = 6.1. The output function f(z) 
along with a scatter plot of the noisy points (z^yi) is shown 
in Fig. [3] Since we are recovering a sparse vector x from a 
linear transform followed by noisy, nonlinear measurements, 
we can consider the problem a nonlinear compressed sensing 



problem. We set (m, n) = (500, 1000) so that the sparse vector 
is undersampled by a factor of two. 

For this problem, we consider two estimators, both based 
on the GAMP algorithm: 

• NL-GAMP: The sum-product GAMP algorithm matched 
to the true input and output distributions. 

• Lin-GAMP: The sum-product GAMP algorithm matched 
to the true input distribution, but the estimator assumes 
a linear channel of the form 

Vi = f(0)zi + w u Wi ~ A/*(0, t w ). 

Since the estimator Lin-GAMP assumes an AWGN output 
channel, it is equivalent to the Bayesian-AMP estimator of 
(37), [38]. For both algorithms, we use the full algorithm, 
Algorithm [T] Other simulations, not reported here, show 
virtually no difference to the simplified algorithm, Algorithm 

El 

Fig.[4]plots the normalized squared error for both algorithms 
over 100 Monte Carlo simulations. In each Monte Carlo 
simulation we computed the normalized squared error (NSE), 

NSE = 101og 10 (E||x-x|| 2 /E||x|| 2 ), 

where the expectation is over the components of the vectors. 
Fig. [4] then plots the median normalized squared error over 
the 100 trials. The median is used since there is a significant 
variation from between trials, and the median removes outliers. 

The first point to notice is that there is a significant gain 
from incorporating the nonlinearity in the output relative to 
using simple linear approximations. Indeed, NL-GAMP shows 
an asymptotic gain of over 11 dB relative to the Lin-GAMP 
method that approximates the output as a linear function. 
Secondly, we see that both algorithms converge very quickly, 
within approximately 10 to 15 iterations. 

Finally, we see that for both methods, the state evolution 
(SE) analysis predicts the per iteration performance extremely 
well. For Lin-GAMP, the SE predicts the median SE within 
0.2 dB and for NL-GAMP, the prediction is within 0.4 dB. 
Although the SE analysis theoretically only provides predic- 
tions for a modified version of the simplified Algorithm [2] we 
see that the prediction holds, at least in this simulation, for 
the full algorithm, Algorithm [T] 

Also note that since NL-GAMP is the sum-product GAMP 
algorithm where the postulated and true distributions are 
matched, the SE equations fall under the special case given 



in Section VI-D The SE equations in this special case were 
derived for sparse matrices in (5) and (8). However, since 
the Lin-GAMP is applied to a nonlinear output where the 
true and postulated distributions are not matched, there are 
no previous SE analyses that can predict the performance of 
the method. Interestingly, the squared error for Lin-GAMP 
does not monotonically decrease; There is a slight increase 
in squared error after iteration 7 and the increase is actually 
predicted by the SE analysis. 

The simulation thus demonstrates that the GAMP method 
can provide a tractable approach to a difficult nonlinear 
compressed sensing problem with significant gains over AMP 
methods based on naive linear approximations. Moreover, 
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the SE analysis can precisely predict the performance of the 
method and quantify the value of incorporating the nonlinear- 
ity. All code for the simulation is available in the sourceforge 
GAMP repository (54). 



Conclusions and Future Work 

We have considered a general linear mixing estimation 
problem of estimating an i.i.d. (possibly non-Gaussian) vector 
observed through a linear transform followed by a com- 
ponentwise (possibly random and nonlinear) measurements. 
The formulation is extremely general and encompasses many 
problems in compressed sensing, classification, learning and 
signal processing. We have presented a novel algorithm, called 
GAMP, that unifies several earlier methods to realize compu- 
tationally efficient and systematic approximations of max-sum 
and sum-product loopy BP. Our main theoretical contribution 
is an extension of Bayati and Montanari's state evolution (SE) 
analysis in (7j to precisely describes the asymptotic behavior 
of the GAMP algorithm for large Gaussian i.i.d. matrices. 
The GAMP algorithm thus provides a computationally simple 
method that can apply to a large class of estimation problems 
with provable exact performance characterizations. As a result, 
we believe that the GAMP algorithm can have wide ranging 
applicability. Indeed, applications of the GAMP methodology 
have been used in nonlinear wireless scheduling problems 
(55) , compressed sensing with quantization (56), and neural 
estimation (57) to name a few. 

Nevertheless, there are a large number of open issues for 
future research, some of which have been begun consideration 
since the original publication of this paper in (58) . 

Non-Gaussian matrices: Our SE analysis currently only 
applies to Gaussian i.i.d. matrices and a significant open ques- 
tion is whether the analyses can be extended to more general 
matrices. Recently, [59] extended the replica analysis in (T7) , 
(T8) to obtain predictions of the behavior of optimal estimators 
for linear mixing problems with general free matrices. One 
avenue of future research is to see if a AMP-like algorithm 
can be constructed that can provably obtain the performance 
predicted for these matrices. 

Learning Distributions: One of the main limitations of 
the GAMP method is that both the sum-product and max-sum 
variants require that the true distributions on the input an out- 
put channels are known. When the distributions are not known, 
one approach is to find a minimax estimator over a class of 
distributions, as was performed for classes of sparse priors in 
(60) . A second approach is to attempt to adaptively estimate 
the distributions assuming some parametric model. One of the 
most promising methods in this regard is to combine GAMP 
with expectation-maximization (EM) estimation as discussed 
in (20), (21), (50), (61), (62). 

Connections to graphical models: Since the GAMP 
method is derived from graphical model techniques, it can 
be incorporated as a component of a larger graphical model 
where the approximate message passing is combined with 
standard belief propagation updates. Such hybrid approaches 
have been explored in a number of works including (63)- 
(67) for extending compressed sensing problems with other 



statistical dependencies between components or estimation of 
additional latent variables. 

Optimality: A significant outstanding theoretical issue is 
to obtain a performance lower bound. An appealing feature of 
the SE analysis of sparse random matrices such asjfT), (4), 
(5), (8) as well as well as the replica analysis in (T7] T (18| , 
|42) is that they provide lower bounds on the performance of 
the optimal estimator. These lower bounds are also described 
by SE equations. Then, using a sandwiching argument - a 
technique used commonly in the study of LDPC codes (68) - 
shows that when fixed points of the SE equations are unique, 
BP is optimal. Finding a similar lower bound for the GAMP 
algorithm is a possible avenue of future research. 
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Appendix A 
Vector-Valued AMP 

A. SE Equations for Vector-Valued AMP 

The analysis of the GAMP requires a vector- valued version 
of the recursion analyzed by Bayati and Montanari. Fix 
dimensions and n^, and let 9 n and Q v be any two 
sets. Let G in (£, d, 6> n ) G R nb and G out (t,b,6 v ) G R nd be 
vector- valued functions over the arguments t = 0,1,2,..., 



b G R n \ d G R nd , U G 6 n and G <d v . Assume 
Gi n (£, d, U ) and G ou t(t, b, V ) are Lipschitz continuous in d 
and b, respectively. Let A G W mxn and generate a sequence 
of vectors bi(t) and dj(t) by the iterations 

bi(t) = ^OijUjit)- A(f) Vi (t-l), (83a) 

3 

dj(t) = X}°« v <(*)-*(*) u i(*) 



(83b) 



where 



and 



Uj(t+1) = G in (t,dj(t),6J), (84a) 
v*(t) = Gout (*,bi(t) ,0?). (84b) 



1 m r) 

*® = -E^ G °ut(t,bi(t),6»V) (85a) 

1=1 
1 71 r) 

A(m) = - V^Gn^d^),^). (85b) 
rn ^-^ ad J 

3=1 

Here, we interpret the derivatives as matrices G W ndXrib 
and A(t+1) G R n ^ xn d. The recursion is initialized with t = 0, 
Vi(t— 1) = and some values for Uj(0). 

Now similar to Section |V| consider a sequence of random 
realizations of the parameters indexed by the input dimension 
n. For each n, we assume that the output dimension m = 
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m(n) is deterministic and scales linearly as in ( [45] ) for some 
/3 > 0. Assume that the transform matrix A has i.i.d. Gaussian 
components ~ A/"(0, 1/m). Also assume that the following 
components converge empirically with bounded moments of 
order 2k — 2 with the limits 



lim 9\ 

n— ^oo 

lim Ui(0) 



lim 0? 



U , 



(86a) 
(86b) 



for random variables 6 V , U and Uq. We also assume Uo is 
zero mean. 

Under these assumptions, we will argue that the SE equa- 
tions for the vector- valued recursion are given by: 

K d (t) 

:= E[G out (t,B(t),^)G out (t,B(t),^) T ] (87a) 
K 5 (m) 

:= /?E [G in (t, D(t), n )G in (t, D(t), U ) T ] (87b) 

where the expectations are over the random variables U and 
6 V in the limits ( [86] ), and B(t) and D(t) are Gaussian vectors 

B(t) - A/*(0, K 5 (t)), D(t) - A/"(0, K d (t)) (88) 

independent of U and # v . The SE equations ( [87] ) are initialized 
with 

K 6 (0) := (3E [U U^] . (89) 

The matrices in ( [85] ) are derived empirically from the 
variables h(t) and d(t). We will also be interested in the case 
where the algorithm directly uses the expected values 

d 



m 

\(t+i) 



db 

d_ 
dd 



G in (tMt),o v ) 



(90a) 
(90b) 



where, again, the expectations are over random variables U 
and 6 V in the limits ( [86] ), and B(t) and D(t) are Gaussian 
vectors in ( [88] ) independent of # M and 

Claim 2: Consider the recursion in ( [83] ) and ( [84] ) with either 
the empirical update ( [85] ) or expected update ( [90] ). Then, under 
the above assumptions, for any fixed iteration number t, the 
variables in the recursions with either the empirical or expected 
update converge empirically as 



lim (dj(t),01 



(D(t),n 



lim(b^),^) 4 (B(*),6>"), 



(91a) 
(91b) 



where are 6 U and U are the random variables in the limits 
( [86] ), and B(t) and D(t) are Gaussians ( [88] ) independent of 
U and (9 V . 

The scalar case when n& = = 1 is rigorously proven by 
Bayati and Montanari (71. The modifications for the vector- 
valued case is straightforward but tedious. However, we only 
provide a sketch of the arguments in Appendix [F] As discussed 
above, a full re-derivation of the proof from \7\ would be long 
and beyond the scope of the paper. Thus, the result is not fully 
rigorous, and we use the term Claim instead of Theorem to 
emphasize the lack of rigor. A complete proof would be a 
valuable avenue of future work. 



Appendix B 
Proof of ClaimQ] 

Claim [T] is a special case of the general result, Claim [2] 
above. Although we have not provided a complete proof of 
Claim [2] the implication from Claim [2] to [T] is completely 
rigorous. 

Let ri5 = 2 and rid = 1 and define the variables 



Vi(t) 

djit) 
9] 

A(m) 

m 



4) J • b - (() 

=F7x(rj(t) -a r {t)xj) 


T*(t) 
1 



Zi 



r r (t) 



a r {t) -1] 



(92a) 
(92b) 
(92c) 
(92d) 
(92e) 

(92f) 



Also, for 6 U 
functions 



(x, q), 9 V = w and b = (z p) T , define the 
G in (t,d,O u ) 



and 



1 (t,T r (t)d + a r (t)x,q,T r (t)) 



G out (t, b, 9 V ) := g out (t,p, h(z, w),T p (t)). 



(93a) 



(93b) 



With these definitions, it is easily checked that simplified 
GAMP algorithm, Algorithm [2] with the modifications in 
Section |V-D| ag rees with the recursion described by equations 
( [83] ), and (|84|), with X(t) being defined with the empirical 
update in ( [85] ) and £(t) being defined by the expected value 
in ([90]). For example, 



hi(t) 



(a) 



(c) 



m 



Xj 







_ Xj(t) _ 







^aijUjfr) - \(t)vi(t-l), 



where (a) follows from the definition of bi(t) in ( |92a| ); (b) 
follows and ([9]) and the fact that z = Ax and (c) follows from 
the remaining definitions in ( [92] ). Therefore, the variables in 
([921 satisfy ([83a]). Similarly, 



(a) 



— (rj^-ar^Xj) 



where (a) follows from the definition of dj(t) in ( |92c| ); (b) 
follows from ( |llb| ) with the modification that r r (t) is replaced 
with r r (t)\ and (c) follows from the other definitions in ([92]) 
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Hence the variables in ( [92] ) satisfy ( |83b| ). The equations in 
( [84] ) can also easily verified. 

We next consider \(t) and £(t). For X(t), first observe that 
d92cb shows that 



d_ 
dd 



g in (£, r r (t)d + a r (t) a^- , ^ , r r (t) ) | 



d=dj(t) 



(94) 



from ( |87a| ). Similarly, one can show that, if ( |97b| ) holds for 
some t, then ( |97a| ) holds for £ + 1. 

With these equivalences, we can now prove the assertions in 
Claimfll To prove ([56]), first observe that the limit ( |9T] ) along 
with ( [96] ) and the definitions in ( [92] ) show that 



lim 



r r (t) 



{r j {t)-a r {t)x 3 \x 3 ,q j = (D(t),X,Q), 



Hence 



1 n 



7*(t+l) ^ - 

m 
(95) 



where the limit is in the sense of empirical convergence of 
order k and D(t) ~ A/*(0, K d (£)) is independent of (X,Q). 
r 1^utftttis limit is equivalent to 



where (a) follows from ( |9a| ) with the modification that ||A|||i 
has been replaced by its expectation E||A|||i = n and (b) 
follows from ( |12b| ) with the modification that r r (t) is replaced 
by r r (t). Combining ( [94] ) and ( [95] ), with the definition of 
Gi n (-) in ( ]93a| ), we see that X(t) in ( |92e| ) satisfies ( |85b| ). 
Similarly, for observe that (|53d|) and (|53b|) show that 



E 



E 



g out (t,P,h(z,W),l*{t)) 



d_ 
dz 

^g out (t,p,h(Z,W),T p (t)) 



z=Zl 



a r (t) 
-1 



Combining these relations with the definition of G out (-) in 
( [93b| ), shows that £(t) defined in ([921) satisfies ( [90] ). 

Therefore, we can apply Claim [2] which shows that the 
limits ([91} hold for the matrices K b (t) and K d (t) from the SE 
equations ( [87] ) with initial condition ( [89] ). Since the definitions 

j, the expectations in ([87]) 



(9 V - jy, 



in ([92]) set 0V = q 5 ) and 0? 
are over the random variables 

U ~(X,Q), 



(96) 



where (X,Q) and W are the limiting random variables in 
Section EH 

Now using the SE equations ( [52] ), ( [53] ) and ( [54] ), one can 
easily show by induction that 



K p (t) = K b (t) =/3K x (t) 



(97a) 
(97b) 



For example, one can show that ( |97a| ) holds for t = by 
comparing the initial conditions ( [89] ) with ( [52] ) and using the 
definition of u(0) in ( |92a| ). Now, suppose that ( |97a| ) holds 
or some t. Then, ( [97a] ) and ([96]) show that (B(t),0 v ) in the 
expectation ( |87b| ) is identically distributed to ((Z, P), W) in 
O p (K?(t)). Therefore, 



g 2 out (t,Y,P,T p (t)) 



(T r (t)) 2 E [G 2 out (tMt),0 v )] 
(r r (t)) 2 K d (t), 



(c) 



where (a) follows from ( |53c| ); (b) follows from the definition 
of b(t) and V in ([92]) and G ou t(-) in ( [93b] ); and (c) follows 



lim (xj,qj,?j{t)) = (X,Q,P), 



(98) 



where 



P = a r (£)X + r r (£)£>(£). 

Since £>(t) - A/"(0, K d (t)), 

r^)D(t) ^AT(0,r(t)) 2 K d (t)) =^(0,rW), 

where the last equality is due to ( |97b| ). Therefore, (X, Q, P) 
in ([98} is identically distributed to (9 r (^ r (t), a r (t)) in gg). 
This proves ( [56] ), and part (b) of Claim [T] Part (c) of Claim [T] 
is proven similarly. 

To prove ( [55] ) in part (a), 



lim 

n— )-oo 
(d) 



(a) 



T^) 



lim r s (t) 



lim — 

n— )>oo m 



9 



t (*,£(*), Vi,T*{t)) 



d_ 
dp 



3out (t,P,Y,TP(t)) 



r r (t) 



where (a) follows from ( |lla| ) with the modification that 
|| A Hl^ = n, (b) follows from ( |10b| ) and that fact that we are 
considering the modified algorithm where r p (t) is replaced 
with r p (t)\ (c) follows the limit ( [57] ) and the assumption 
that the derivative of g ou t(t : p 7 y, f p (t)) is of order fc; and (d) 
follows from ([53]). Similarly, one can show 



lim t p (£) =r p (t). 

n— )*oo 

This proves ( [55] ) and completes the proof of Claim [T] 



Appendix C 
Max-Sum GAMP 

In this section, we show that with the functions g- in and g out 
in fl8] ) and ( [24] ), the GAMP algorithm can be seen heuristically 
as a first-order approximation of max- sum loopy BP for the 
MAP estimation problem. The derivations in this section are 
not rigorous, since we do not formally claim any properties 
of this approximation. 
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Fig. 5. Factor or Tanner graph for the linear mixing estimation problem. 

A. Max-Sum BP for MAP Estimation 

We first review how we would apply standard max- sum 
loopy BP for the MAP estimation problem ( [16] ). For both 
the MAP and MMSE estimation problems, standard loopy 
BP operates by associating with the transform matrix A a 
bipartite graph G = (V,E) called the factor or Tanner graph 
as illustrated in Fig. [5] The vertices V in this graph consists 
of n "input" or "variable" nodes associated with the variables 
Xj, j = 1, ... ,n, and m "output" or "measurements" nodes 
associated with the transform outputs Zi, i = 1, . . . , m. There 
is an edge G E between the input node xj and output 

node Zi if and only if ^ 0. 

Now let Aj(xj) be the marginal maxima in flT] ), which we 
can interpret as a "value" function on the variable Xj. Max- 
sum loopy BP iteratively sends messages between the input 
nodes Xj and output nodes Z{ representing estimates of the 
value function Aj (xj ) . The value message from the input node 
Xj to output node Z{ in the tth iteration is denoted Ai«_j (t, xf) 
and the reverse message for z\ to Xj is denoted A^j(t, Xj). 

Loopy BP updates the value messages with the following 
simple recursions: The messages from the output nodes are 
given by 

A^j(t, Xj) = const 

+ max/ ou t(zi,2/i) + V"A^ r (£,x r ) (99) 

X z — ' 

where the maximization is over vectors x with the jth com- 
ponent equal to Xj and Zi — x, where is the zth row 
of the matrix A. The constant term is any term that does not 
depend on Xj, although it may depend on t or the indices i 
and j. The messages from the input nodes are given by 

An-j(t+l,Xj) = const 

) + ^A i -+ j (t,x j ) (100) 

The BP iterations are initialized with t = and 
A i _ j (-l,x j ) = 0. 

The BP algorithm is terminated after some finite number of 
iterations. After the final tth iteration, the final estimate for Xj 
can be taken as the maximum of 



Aj{t+l,Xj) = f in (xj, qj ) + ^2Ai^j(t,Xj). 



(101) 



B. Quadratic Legendre Transforms 

To approximate the BP algorithm, we need the following 
simple result. Given a function / : R — >> R, define the 
functions 



(Lf)(x,r,r) 
(r/)(r,r) 
(A/)(r,r) 

(A«/)(r,r) 



(102a) 

:= argmax(£/)(;r,r, r) (102b) 

a: 

max(L/) Or,r,r) (102c) 

^(A/)(r,r), (102d) 

over the variables r, r G R with r > and fc = 1,2, — The 
function A/ can be interpreted as a quadratic variant of the 
standard Legendre transform of / (69). 

Lemma 1: Let / : R — )> R be twice differentiate and 
assume that all the maximizations in ( |102| ) exist and are 
unique. Then, 



(A«/)(r,r) 
(A( 2 )/)(r,r) 



/"(£) 



<9r 



l-rf(i)' 
1 

l-rf(i) 



(103a) 
(103b) 
(103c) 



where x = (T/)(r, r). 

Proof: Equation ( |103a| ) follows from the fact that 



(A (1) /)(r,r) 
Next observe that 



<9L(r, x) 



<9r 



<9 2 
<9 2 

-L(x, r) 



/"(*) - - 

T 

1 



<9x<9r r 
So the derivative of x is given by 



dx 2 
l/r 



L(x, r) 



a 2 



dxdr 
1 



L(x, r) 



l/r-/"(x) l-r/"(a;) ! 
which proves ( |103c| ). Hence 

(A (2) /)(r,r) = | : (A( 1 )/)(r,r) 
1 



t Vl -rf"{x) 
which shows ( |103b| ). 



1 = 



l-rf(J)' 



C. GAMP Approximations 

We now show that the GAMP algorithm with the functions 
in ([18]) and ( [24] ) can be seen as a quadratic approximation of 
the max-sum loopy BP updates ( |100| ). The derivation is similar 
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to the one given in |70| for the Laplacian AMP algorithm [6]. and if (t) be given as in ( [5a] ). Then it follows from ( |108| ) that 
We begin by considering the output node update ([99]). Let 



Xj(t) 

x i^j W 
1 

1 



arg max Aj(t, Xj), 
arg max Ai«_j (t , ) , 



Pi^jit) = pi(t) - aijXi^j(t) (Ula) 
*f-;(*) = rfW-a^W. (111b) 



(104a) 
(104b) 

Using ( |111| ) and neglecting terms of order O(a^), ( |107| ) can 
(104c) be further approximated as 



. (104d) 



1^3 



Now let 



Now, for small a ir , the values of x r in the maximization ([99]) 
will be close to xn-j(t). So, we can approximate each term 
Ai4- r (t,x r ) with the second order approximation 



(t,Xj) « H (pi(t) + aij(xj -Xj),yi,7f(t)) + const. 

(112) 

gout(p,y,r*) := -j^H(p,y,T*). (113) 



A^^_ r (t,x r ) ~ A^^_ r (t, X{i— r (£)) 



i 



2t?(*) 



(X r X^^_ r (t)) 



(105) 



Comparing iJ(-) in ( |109| ) with the definitions in ( |102| ) and 
using the properties in ( |103| ), it can be checked that 

ff(p,2/,rP) = (A/ out (.,j/))(p,rP), 



where we have additionally made the approximation . . 

r^ r (t) « Tf (t) for all i. Now, given Xj and z if consider and that WO- in <TT3» and its derivative agree with the 

definitions in (|24j) and ([27]). 

Now define si(t) and rf(t) as in ([6]). Then, a first order 
(106) approximation of ( |112| ) shows that 



the minimization 



subject to 



J := min^ — (a; r - x^ r ) 2 , 

(Xr^j X j | ^ (Xj^f X f . 



A^j(t, Xj) w const 

+s i (t)a ij (x j - Xj(t)) 



' L j\ L JJ 2 a ij( X j X j(t)) • 



A standard least squares calculation shows that the minimiza- 
tion ( |106| ) is given by 

J = 9 ^ ] ^~~p — Pi-*j ~~ a ij x j) 



const [si(t)ajj + a^r?(<;)Sj(<;)] x 



(114) 



where 



where again the constant term does not depend on xj . We next 
consider the input update ( [100] ). Substituting the approxima- 
tion ( |114| ) into ( |100| ) we obtain 

Ai«_j(t+1, Xj) « const 



+ fin(Xj,qj) 



So, the approximation ( ]105| ) reduces to 

Ai^j{t, Xj) 



2T r {t) (r t ^(t)- Xj ) 2 , (115) 



max 



/out (^ij 



1 



where 



2*f-,-(t) 
+const 

a i j ^ j 5 2/i 7 r fU j (t) ) + const 
rU 3 {t) ■= £MV(*)> 



( Pi — >■ j ( ^ ) ft z j ^ j ) 



2 



where the constant term does not depend on Xj and 
1 



(116a) 



(107) 



x j (t) + ^ w ^(*) a ^ • (1 16b) 



(108a) Now define 0. n ( r g? r r) as in (Tfgj Then? using ([115]) an d ([18), 
(t) in ( |104b| ) can be re- written as 



and 



H(p, y-, r p ) := max 



fout(z,y) - t^(z-p) 2 



(108b) 



(109) 



(117) 



Then, if we define f}(t) and rj(t) as in ([7]), fi<_j(t) and 
t 4 ^_ ■ (t) in ( |116| ) can be re-written as 



The constant term in ( |107| ) does not depend on Z{. Now let 

Pi(t) ^aijXi^jit), (110) 



rUt). 



(118a) 



fi-W-rJ^Oi^iW, (118b) 
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where in the approximations we have ignored terms of order 
0(af •). We can then simplify ( |117| ) as 



(a) 



(t+1) 

Xjit+V-OijSjWDjit+l) 



(119) 



where (a) follows from substituting ( |118| ) into ( |117| ) and (b) 
follows from a first-order approximation with the definitions 



Now 



%(m) 

Dj(t+1) 



(120a) 



9in (rj(t),qj,Tj(t)) 
T H t )^9in(r j (t),q j ,rJ(t)). (120b) 



«^(*)^( r /in(^0)(ri(t),rj(t)) 



(d) 



%(*+!)) 



*f(*+l), 



(121) 



where (a) follows from ( |120b| ) and by comparing ( [18] ) to 
( fT02b] ); (b) follows from ( fT03c] »; (c) follows from fTB) and (d) 
follows from ( [T04d| . Substituting ( [121] into ( fTl9] > and ( [TTU| , 
we obtain 

Pi(t) = ^OijXj^-if^Siit-l), (122) 

which agrees with the definition in ( [5c] ). In summary, we 
have shown that with the choice of gm(') and g ou t(-) in jjj) 
and ( [24] ), the GAMP algorithm can be seen as a quadratic 
approximation of max- sum loopy BP for MAP estimation. 



Appendix D 
Sum-Product GAMP 

A. Preliminary Lemma 

Our analysis will needs the following standard result. 
Lemma 2: Consider a random variable U with a conditional 
probability density function of the form 

Pu\v(u\v) := -— ^ exp (</> (u) + uv) , 
Z[y) 

where is a normalization constant (called the partition 

function). Then, 



d_ 

dv 

dv 2 



\ogZ(v) = E(U\V = v) 
d 



\ogZ(v) 



dv 



(123a) 

E(U\V = v) (123b) 



B. Sum-Product BP for MMSE Estimation 

The sum-product loopy BP algorithm for MMSE estimation 
is similar to max-sum algorithm in Appendix [C] For the sum- 
product algorithm, the BP messages can also be represented as 
functions A^_^-(£, Xj) and A^_j(£, However, these mes- 
sages are to be interpreted as estimates of the log likelihoods 
of the variables Xj , conditioned on the system input and output 
vectors, q and y, and the transform matrix A. 

The updates for the messages are also slightly different 
between the max-sum and sum-product algorithms. For the 
sum-product algorithm, the output node update ( [99] ) is replaced 
by the update equation 

Ai^j(t,Xj) = logE(pY\z(yi,Zi)\xj) + const, (124) 

where the expectation is over the random variable Zi = x 
with the components x r of x being independent with distri- 
butions 

Pi4- r {x r ) ex exp Ai«_ r (t, x r ). (125) 

The constant term in ( |124| ) can be any term that does not 
depend on Xj . The sum-product input node update is identical 
to the max-sum update ( |100| ). Similar to the max-sum esti- 
mation algorithm, sum-product BP is terminated after some 
iterations. The final estimate for the conditional distribution 
of xj is given by 

Pj (xj ) oc exp Aj (t, xj ) , 

where Aj(t,Xj) is given in ( |101| ). From this conditional 
distribution, one can compute the conditional mean of Xj. 

C. GAMP Approximation 

We now show that with the gm(-) in ( [3T] ) and g ou t(') in 
([36]), the GAMP algorithm can heuristically be regarded as a 



Gaussian approximation of the above updates. The calculations 
are largely the same as the approximations for max-sum BP 
in Appendix [C] so we will just point out the key differences. 
We begin with the output node update ( |124| ). Let 



1^3 



i(t) 
,•(*) 

(t) 



E^-IA,- (*,•)], 
E[a;,- 1 A^- (*,•)], 

|A,-(t,-)] 



v 3\ 

varfx 



var[^|A^(t,-)], 



(126a) 
(126b) 
(126c) 
(126d) 



where we have used the notation E(g(x)\A(-)) to mean the 
expectation over a random variable x with a probability density 
function 

Pa (#) oc exp (A(x)) . (127) 

Therefore, xn_ r (t) and r^_ r (t) are the mean and variance 
of the random variable x r with density ( |125| ). Now, the 
expectation in \\2A) is over zi = &J x with the components x r 
being independent with probability density ( |125| ). So, for large 
n, the Central Limit Theorem suggests that the conditional 
distribution of Z{ given Xj should be approximately Gaussian 



= war(U\V = v). (123c) z% „ A/*(p^-(£), t£_ .(*)), where p^(t) and 7*(t)) are 



Proof: The relations are standard properties of exponen- 
tial families |23l. ■ 



defined in ( |108| ). Hence, Ai->j(t,Xj) can be approximated as 



(t,^-) « H (p^(t),2/i,rf (t)) , 



(128) 
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where 



H{p,y,r p ) :=logE [py\ z (y\z)\p, r p ] , (129) 



and the expectation is over random variable z ~ A/*(p, r p ). It 
is easily checked that 



H{p, y, r p ) := log p(z\p, y, r p ) + const, 



(130) 



where p(z\p, y, r p ) is given in ( [35] ) and the constant term does 
not depend on p. 

Now, identical to the argument in Appendix IV-A| we can 
define % (t) as in ( | 1 1 Q| > and r p (t) as in ( [5a) so that A^j (t , ) 
can be approximated as ( |112| ). Also, if we define #out(p, 2/? 
as in ( |113| ), and si(t) and rf(t) as in ([6]), then s~i(t) and 
r/(t) are respectively the first and second-order derivatives 
of H(pi(t), yi,r p (£)) with respect to p. Then, taking a second 
order approximation of ( |112| ) results in ( |114| ). 

Also, using ( |13Q| ), g ut(') as defined in ( | 1 1 3| > agrees with 
the definition in ( [38] ). To show that g ou .t{ m ) is also equivalent 
to ([36]) note that we can rewrite H(-) in ( |130| ) as 



H(p,y,r p ) := log 



Z(p,2/,rP) 



where 



2r^ 



Z (p,y,r p ) := y exp 

fout(z,y) + 



(2pz - z 2 ) 



/ 



exp 



2r^ 



(2pz - z 2 ) 



Then the relations ( [36] ) and ( [37] ) follow from the relations 
( [T23) . 

We next consider the input node update ( |1QQ| ). Similar to 
Appendix [C] we can substitute the approximation ( |114| ) into 
( |100| ) to obtain ( |115| ) where r^j(t) and r£_-(t) are defined 
in ( |TT6l ). 

Using the approximation ( |115| ) and the definition of Fi n (-) 
in ( [T9) , the probability distribution p&(xj) in ( |127| ) with A = 

Ai<-j(t,Xj) is approximately 

PA^-(t,.)Oj) 

« - exp F in Oj , r*<_ j (t) , ^ , (t) ) , 

where Z is a normalization constant. But, based on the form of 
Fi n (-) in ( [T9) , this distribution is identical to the conditional 
distribution of X in r (r r ) in ([48]) given Q) = (r,q). 
Therefore, if we define gm(') as in ( |3T) , it follows that Xi^j(t) 
in ( |126b| ) satisfies ( |117| ). The remainder of the proof now 
follows as in the proof of Appendix [D] 

Appendix E 
Proof of (1781 



Define 



D(K P ) := E 
d 



d_ 
dp 



g out (t,h(Z,W),P,T p (t)) 



E 



dz 



g out (t,h(Z,W),P,T p (t)) 



(131) 



where the expectation is over (Z,P) ~ A/*(0, K p ) and ~ 
Pw(^) independent of (Z, P). We must show that when K p is 
of the form (73) and # ou t(-) is given by ([36), then D(K7) = 0. 
To this end, let Q be the two-dimensional random vector 

Q= [Z P] T ~A/"(0,K?) 

and let G e R lx2 be the derivative 

d 



(132) 



E 



<9q 



5out (a(^^),P,T P (t)) 



which is the row vector whose two components are the partial 
derivatives with respect to z and p. Thus, D(K P ) in ( |131| ) can 
be re-written as 



1 



-1 



(133) 



Also, using Stein's Lemma (Lemma [4] below) and the covari- 
ance ([132), 

E[Qg out (t,h(Z,W),P,¥>(t)j 
= E [QQ T ] G T = K P G ,T . 
Applying this equality to ( |133| ) we obtain 

D(K p ) :=E [<P(Z,P)g out (t,h(Z,W),P,r p (t))\ , (134) 
where </>(•) is the function: 

When K p is of the form ([73), then it is easily checked that 



j*(ty 



(135) 



Therefore, 



D(K P ) 



(b) 



(a) 1 



r p (t) 



E 



Pg out (t,h(Z,W),P,T?(t)) 



(c) 1 

T*(t) 
1 



r*(t) 



d 



P^logp YlP (Y\P) 



where (a) follows from substituting ( |135| ) into ( |134| ); (b) 
follows from ( [74) ; (c) follows from taking the derivative of 
the logarithm. This shows that D(K P ) = and proves ( [78) . 



Appendix F 
Proof Sketch for Claim[2] 

As stated earlier, Bayati and Montanari in [7 ] already proved 
the result for scalar case when = = 1. Only very minor 
modifications are required for the vector-valued case, so we 
will just provide a sketch of the key changes. 

For the vector case, it is convenient to introduce the notation 
b(t) to denote the matrix with columns bi(t): 

- bx(t) r 



b(t) := 
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We can define u(t), v(t) and d(t) similarly. Then, in analogy 
with the definitions in (Tj, we let 



x(t) := d(i) 
y(t) := b(t) 

and define the matrices 



■ u(t)£(t) T e R n> 
v(t-i)X(t) T G 



x(t) 

Y(t) 

u(t) 
V(t) 



^rixtnd 
vmxtrid 



= [x(0)|...|x(t-i)]e] 
= |y(0)|---|y(t-l)]e: 
= [u(0)|..>(t-i)]€: 

= [v(o)|--.|v(t-i)]e: 

The updates ( [83] ) can then be re-written in matrix form as 

X(t) = A T V(t), Y(t) = AU(t). (136) 

Next, also following the proof in [7], let V||(t) be the 
projection of each column of v(t) onto the column space of 
Y(t), and let v±(t) be its orthogonal component, vj_(t) = 
v(t) — vi I (t). Thus, we can write 



t-i 



V||(t) = ^v(i)a,(t), 



(137) 



i=0 



for matrices ot^t) G R n ^ xnd . Similarly, let U||(t) and u_l(£) 
be the parallel and orthogonal components of u(t) with respect 
to the column space of U(t) so that 



u||(t) = ^u(i)AW, 



(138) 



i=0 



where e R n ^ xn ^. 

Now let 6(^1,^2) be the sigma algebra generated by 
b(0),...,b(*i), v(0),...,v(ti), d(0),...,d(£ 2 - 1) and 
u(0), . . . , 11(^2 )• Also for any sigma- algebra, £5, and random 
variables X and Y, we will say that X is equal in distribution 
to Y conditional on if E(0(X)Z) = E(<fr(Y)Z) for any Z 
that is 0-measurable. In this case, we write X\& =Y. 

With these definitions, the vector- valued analogy of the 
main technical lemma (7J Lemma 1] can be stated as follows: 

Lemma 3: Under the assumptions of Claim [2] the following 
hold for all t > 0: 

(a) The conditional distribution of d(t+l) is given by 



d(t)\ 



b(t)\ 



<5(t,t) 
t-1 



d(i)cti(t) + A T v ± (t) + U(*)^ t (l) 



0(t-l,i) 
t-1 



± ^>(i)A(t) + Au ± (i)+V(i)^ t (l) 

2=0 

where A is an independent copy of A, and the matrices 
OLi(t) and (3i(t) are the coefficients in the expansion ( |137| ) 
and ( | 1 3 8| >. The matrix U(t) is such that its columns form 
an orthogonal basis for the column space of U(t) with 
U(t) T U(£) = nl tnd . Similarly, V(t) is such that its 
columns form an orthogonal basis for the column space 



of V(t) with Y(t) T Y(t) = ml tnh . The vector ~^ t (l) 
goes to zero almost surely, 
(b) For any pseudo-Lipschitz functions (j)d : R nd ( £+1 ) x 
6 n R and (j) h : R^(t+i) x 0M M: 



1 n 

lim -X;^(di(0),-" .^(t),^) 



E[^(D(0),...,D(t),6» u )] 
lim -^^(O),-..,^^),^) 



n— )*oo m 

= E[0 6 (B(O),...,B(*),0 W )] 

where the expectations are over Gaussian random vectors 
(D(0) , . . . , D(t)) and (B(0), . . . , B(t)) and the random 
variables # n and # v in the limit ( [86] ). The variables U and 
U are independent of B(r) and D(r) and the marginal 
distributions of B(r) and D(r) are given by ( [88] ). 
(c) For all < r, s < t, the following limit exists hold are 
bounded and are non-random 



1 n 

lim - Vd 7 (r)d 7 (s) T 
= lim — V" v i (r)v i (s) T 

ra— >-oo 772 z — * 

2=1 

lim -Vb,(r)b,(s) T 

rwoo m z — ' 
i=l 

1 U 



(d) Suppose <p d : R nd x 9 n ^ R and <p 6 : R Ub x6MM 
are almost everywhere continuously differentiable with 
a bounded derivative with respect to the first argument. 
Then, for all all < r, s < t the following limits exists 
are bounded and non-random: 



1 n 

3 = 1 

1^00 n z — ' 

^ m 

lim — V"b i (r)(^ 6 (b j (s)) T 

lim - Vb l (r)b i (s) r G fc (s) 2 

7,-)-no m f * 



ra— >-oo m 



where Gd(s) and G&(s) are the empirical derivatives 

1 n r) 

G d (s) := lim ^(d,( S ),n 
1 m f) 

G b (s) := lim -V- W (b,( S ),9"). 
n^oo m ab 

2 = 1 
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(e) For t = k — 1, the following bounds hold almost surely 
lim ^f>^)H 2 

J = l 

m— >-oo 777, z — ' 



I 2^ 



I 2^ 



< OO 



< OO 



This lemma is a verbatim copy of (71 Lemma 1] with 
some minor changes for the vector- valued case. Observe that 
Claim[2]is a special case of part (b) of Lemma[3]by considering 
functions of the form 



b d (d(o),...,d(t),e u ) 

h(h(O),...,b(t),0 u ) 



i(d(t),e u ) 



That is, we consider functions that only depend on the most 
recent iteration number. 

The proof of Lemma [3] also follows follows almost iden- 
tically to the proof of the analogous lemma in the scalar 
case in IT). The key idea in that proof is the following 
conditioning argument, originally used by Bolthausen in |4T| : 
To evaluate the conditional distributions of d(t) with respect 
to and b(t) with respect to <8(t-l,£) in part (a) 

of Lemma |3j one evaluates the corresponding conditional 
distribution of the matrix A. But, this conditional distribution 
is precisely identical to the distribution of A conditioned on 
linear constraints of the form ( |136| ). But, this distribution is 
just the distribution of a Gaussian random vector conditioned 
on it lying on an affine subspace. That distribution has a simple 
expression as a deterministic offset plus a projected Gaussian. 
The detailed expression are given in (7J Lemma 6] and the 
identical equations can be used here. 

The proof uses this conditioning principle along with an 
induction argument along the iteration number t and the 
statements (a) to (e) of the lemma. We omit the details as 
they are involved but follow with only minor changes from 
the original scalar proof in (7). 

The only one other non-trivial extension that is needed is 
the matrix form of Stein's Lemma, which can be stated as: 

Lemma 4 (Stein's Lemma /[771/J: Suppose Zi G W 11 and 
Z 2 G M n2 are jointly Gaussian random vectors and cp : R n2 — » 
R ns is any function such that the expectations 



G :=E 



_d_ 
dz 2 



y(z 2 ) 



and 



exists. Then, 



E(Zk^(Z 2 ) t ) G 



pixn 3 



E(Z 1 ( / 9(Z 2 ) T ) = E ((Zx - Zi)(Z 2 - Z 2 ) T ) G T , 

where Z$ is the expectation of Z$. 

Note that part (d) of Lemma [3] is of the same form of this 
lemma. 
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